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Abstract 

The growing availability of network data and of scientific interest in distributed systems has led to the 
rapid development of statistical models of network structure. Typically, however, these are models for the 

i— i entire network, while the data consists only of a sampled sub-network. Parameters for the whole network, 

which is what is of interest, are estimated by applying the model to the sub-network. This assumes that 

{/y the model is consistent under sampling, or, in terms of the theory of stochastic processes, that it defines a 

projective family. Focussing on the popular class of exponential random graph models (ERGMs), we show 
that this apparently trivial condition is in fact violated by many popular and scientifically appealing models, 
and that satisfying it drastically limits ERGM's expressive power. These results are actually special cases of 
more general ones about exponential families of dependent random variables, which we also prove. Using 

l — 1 such results, we offer easily checked conditions for the consistency of maximum likelihood estimation in 

ERGMs, and discuss some possible constructive responses. 

Keywords: Exponential family, Projective family, Network models, Exponential random graph model, Suffi- 
cient statistics, Independent increments, Network sampling 
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1 Introduction 

In recent years, the rapid increase in both the availability of data on networks (of all kinds, but especially 
social ones) and the demand, from many scientific areas, for analyzing such data has resulted in a surge of 
generative and descriptive models for network data (Easley and Kleinberg, 2010; Newman, 2010). Within 
. — statistics, this trend has led to a renewed interest in developing, analyzing and validating statistical mod- 

^ els for networks (Goldenberg et ah, 2009; Kolaczyk, 2009). Yet as networks are a non-standard type of 

data, many basic properties of statistical models for networks are still unknown or have not been properly 
explored. 

In this article we investigate the conditions under which statistical inferences drawn over a sub-network 
will generalize to the entire network. It is quite rare for the data to ever actually be the whole network 
of relations among a given set of nodes or units 1 ; typically, only a sub-network is available. Guided by 
experience of more conventional problems like regression, analysts have generally fit models to the available 
sub-network, and then extrapolated them to the larger true network which is of actual scientific interest, 
presuming that the models are, as it were, consistent under sampling. What we show is that this is only valid 



* Email: cshaliziOcmu.edu 
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1 This sense of the "whole network" should not be confused with the technical term "complete graph", where every vertex has a 
direct edge to every other vertex. 
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for very special model specifications, and the specifications where it is not valid include some of which are 
currently among the most popular and scientifically appealing. 

In particular, we restrict ourselves to exponential random graph models (ERGMs), undoubtedly one of 
the most important and popular classes of statistical models of network structure. In addition to the general 
works already cited, the reader is referred to Frank and Strauss (1986); Wasserman and Pattison (1996); An- 
derson et al. (1999); Snijders et al. (2006); Robins et al. (2007); Wasserman and Robins (2005); Handcock 
et al. (2008); Park and Newman (2004b), for detailed accounts of these models. There are many reasons 
ERGMs are so prominent. On the one hand, ERGMs, as the name suggests, are exponential families, and 
so they inherit all the familiar virtues of exponential families in general: they are analytically and inferen- 
tially convenient (Brown, 1986); they naturally arise from considerations of maximum entropy (Mandelbrot, 
1962) and minimum description length (Griinwald, 2007), and from physically-motivated large deviations 
principles (Touchette, 2009); and if a generative model obeys reasonable-seeming regularity conditions while 
still having a finite-dimensional sufficient statistic, it must be an exponential family (Lauritzen, 1988) 2 . On 
the other hand, ERGMs have particular virtues as models of networks. The sufficient statistics in these mod- 
els typically count the number or density of certain "motifs" or small sub-graphs, such as edges themselves, 
triangles, fc-cliques, stars, etc. These in turn are plausibly related to different network-growth mechanisms, 
giving them a substantive interpretation. (See, e.g., Goodreau et al. (2009) as an exemplary application of 
this idea, or, more briefly, §5 below.) Moreover, the important task of edge prediction is easily handled in 
this framework, reducing to a conditional logistic regression (Handcock et al, 2008). Since the development 
of (comparatively) computationally-efficient maximum-likelihood estimators (based on Monte Carlo sam- 
pling), ERGMs have emerged as flexible and persuasive tools for modeling network data (Handcock et al, 
2008). 

Despite all these strengths, however, ERGMs are tools with a serious weakness. As we mentioned, it 
is very rare to ever observe the whole network of interest. The usual procedure, then, is to fit ERGMs 
(by maximum likelihood or pseudo-likelihood) to the observed sub-network, and then extrapolate the same 
model, with the same parameters, to the whole network; often this takes the form of interpreting the param- 
eters as "provid[ing] information about the presence of structural effects observed in the network" (Robins 
et al, 2007, p. 194), or the strength of different network-formation mechanisms. (Ackland and O'Neil 2011; 
Daraganova et al. 2012; de la Haye et al. 2010; Gondal 2011; Gonzalez-Bailon 2009; Schaefer 2012; Ver- 
meij et al. 2009 are just a few of the more recent papers doing this.) This obviously raises the question of 
the statistical (i.e., large sample) consistency of maximum likelihood estimation in this context. Unnoticed, 
however, is the logically prior question of whether it is probabilistically consistent to apply the same ERGM, 
with the same parameters, both to the whole network and its sub-networks. That is, whether the marginal 
distribution of a sub-network will be consistent with the distribution of the whole network, for all possible 
values of the model parameters. The same question arises when parameters are compared across networks 
of different sizes (as in, e.g., Faust and Skvoretz 2002; Goodreau et al. 2009; Lubbers and Snijders 2007). 
When this form of consistency fails, then the parameter estimates obtained from a sub-network may not pro- 
vide reliable estimates of, or may not even be relatable to, the parameters of the whole network, rendering 
the task of statistical inference based on a sub-network ill-posed. We formalize this question using the notion 
of "projective families" from the theory of stochastic processes. We say that a model is projective when the 
same parameters can be used for both the whole network and any of its sub-networks. In this article, we fully 
characterize projectibility of discrete exponential families and, as corollary, show that ERGMs are projective 
only for very special choices of the sufficient statistic. 

Outline Our results are not specific just to networks, but pertain more generally with exponential families 
of stochastic processes. §2 therefore lays out the necessary background about projective families of distri- 
butions, projective parameters, and exponential families in a somewhat more abstract setting than that of 
networks. §3 shows that a necessary and sufficient condition for an exponential family to be projective is 
that the sufficient statistics obey a kind of additive decomposition. This in turn implies strong independence 

2 Mandelbrot (1962) is still one of the best discussions of the interplay between the formal, statistical and substantive motivations 
for using exponential families. 
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properties. We also prove results about the consistency of maximum likelihood parameter estimation under 
these conditions (§4). In §5, we apply these results to ERGMs, showing that most popular specifications for 
social networks and other stochastic graphs cannot be projective. We then conclude with some discussion on 
possible constructive responses. The proofs are contained in §??. 

Related work An early recognition of the fact that sub-networks may have statistical properties which 
differ radically from those of the whole network came in the context of studying networks with power-law 
("scale-free") degree distributions. On the one hand, Stumpf et al. (2005) showed that "subnets of scale-free 
networks are not scale-free"; on the other, Achlioptas et al. (2005) demonstrated that a particular, highly 
popular sampling scheme creates the appearance of a power-law degree distribution on nearly any network. 
While the importance of network sampling schemes has been recognized since then (Kolaczyk, 2009, ch. 
5), and valuable contributions have come from, e.g., Kossinets (2006); Handcock and Gile (2010); Krivitsky 
et al. (2011); Ahmed et al. (2010), we are not aware of any work which has addressed the specific issue 
of consistency under projection which we tackle here. Perhaps the closest approaches to our perspective 
are Orbanz (2011) and Xiang and Neville (2011). The former considers conditions under which infinite- 
dimensional families of distributions on abstract spaces have projective limits. The latter, more concretely, 
addresses the consistency of maximum likelihood estimators for exponential families of dependent variables, 
but under assumptions (regarding Markov properties, the "shape" of neighborhoods, and decay of correla- 
tions in potential functions) which are basically incomparable in strength to ours. 

2 Projective Statistical Models and Exponential Families 

Our results about exponential random graph models are actually special cases of more general results about 
exponential families of dependent random variables, and are just as easy to state and prove in the general 
context as for graphs. Setting this up, however, requires some preliminary definitions and notation, which 
make precise the idea of "seeing more data from the same source". In order to dispense ourselves from any 
measurability issues we will implicitly assume the existence of an underlying probability measure for which 
the random variables under study are all measurable. Furthermore, for the sake of readability we will not 
rely on the measure theoretic notion of filtration: though technically appropriate, it will add nothing to our 
results. 

Let A be a collection of finite subsets of a denumerable set I partially ordered with respect to subset 
inclusion. For technical reasons, we will further assume that A has the property of being an ideal: i.e. if A 
belongs to A then all subsets of A are also in A and if A and B belongs to A, then so does their union. We 
may think of passing from A to B D A as taking increasingly large samples from a population, or recording 
increasingly long time series, or mapping data from increasing large spatial regions, or over an increasingly 
dense spatial grid, or looking at larger and larger sub-graphs from a single network. Accordingly, we consider 
the associated collection of parametric statistical models {Va,q}a^a indexed by A, where, for each A £ A, 
Va,o = {lPU,0}eee is a family of probability distributions indexed by points 6 in a fixed open set 6 C R d . The 
probability distributions in Va,& are a l so assumed to be supported over the same Xa, which are countable 3 
sets for each A. We assume that the partial order of A is isomorphic to the partial order over {Xa\a^a> iri 
the sense that A c B if and only if Xb — Xa x X b \a- 

For given 6 and A, we denote with Xa the random variable distributed as P^g. In particular, for a given 
9 e 6, we can regard the {¥a,9}ae.a as finite dimensional (i.e. marginal) distributions. 

For each pair A, B in A with A c B, we let ttb^a ■ Xb — > Xa be the natural index projection given by 
it Bi-^a(% Ai %b\a) = x a- In the context of networks, we may think of I as the set of nodes of a possibly infinite 
random graph, which without loss of generality can be taken to be {1,2,...} and of A as the collection of 
all finite subsets of I. Then, for some positive integers n and m, we may, for instance, take A = {1, . . . , n} 
and B = {1, . . . , n, . . . , n + to}, so that Xa will be the induced sub-graph on the first n nodes and X B the 

3 Our results extend to continuous observations straightforwardly, but with annoying notational overhead. 
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B D A 

Figure 1: Projective structure for networks: when the set of observables A is contained in the larger set 
of observables B, Xa (on the right) can be recovered from X B (on the left) through the projection txb^a, 
which simply drops the extra data. 

induced sub-graph on the first n + m nodes. The projection ttb<-+a then just picks out the appropriate sub- 
graph from the larger graph (see Figure 1 for a schematic example). We will be concerned with a natural 
form of probabilistic consistency of the collection {Va,0}aea which we call projectibility, defined below. 

Definition 2.1. The family {Va,q}a£A is projective if, for any A and B in A with Ad B, 

PA,0=PB,0°7r-^ A , (1) 

See Kallenberg (2002, p. 115) for more general treatment of projectibility. In words, {Va,q}a^a is a 
projective family when A c B implies that Va.b can be recovered by marginalization over Fg t g, for all 9. 
(Figure 2 illustrates.) Within a projective family, Fg denotes the infinite-dimensional distribution, which thus 
exists by the Kolmogorov extension theorem (Kallenberg, 2002, Thm. 6.16, p. 115). 

Projectibility is automatic when the generative model calls for independent and identically distributed 
(IID) observations. It is also generally unproblematic when the model is specified in terms of conditional dis- 
tributions: one then just uses the Ionescu Tulcea extension theorem in place of that of Kolmogorov (Kallen- 
berg, 2002, Thm. 6.17, p. 116). However, many models are specified in terms of joint distributions for 
various index sets, and this, as we show in Theorem 3.2, can rule out projectibility. 

We restrict ourselves to exponential family models by assuming that, for each choice of 9 e and 
A € A, Fa,8 has density with respect to the counting measure over Xa given by 

PA,e(x) = 77^, x £ Xa, (2) 

za{0) 

where tA ■ Xa — > M d is the measurable function of minimal sufficient statistics, and za ■ © — > K is the 
partition function given by 

z A {9)= e< 9 ^ x ». (3) 

xGX A 

If Xa ~ Pa,8, we w iU write T4 = Ia{Xa) for the random variable corresponding to the sufficient statistic. 
Equation (2) implies that Ta itself has an exponential family distribution, with the same parameter 9 and 
partition function za{9) (Brown, 1986, Prop. 1.5). Specifically, the distribution function is 

where the term VA(t) = \{x e Xa ■ tA{x) = t}\, which we will call the volume factor, counts the number of 
points in Xa with the same sufficient statistics t. The moment generating function of Ta is 

MeA& = Ee [e<*' T *>l = z A (9 + tf>)/z A {9). (5) 
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Figure 2: Illustration of projectibility: the probability of a small configuration over A (left), calculated 
according to Va.b, must match the sum of all larger configurations over B containing it (right), calculated 
according to Fb,b- 
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If the sufficient statistic is completely additive, i.e., if t A (x A ) = J2i eA t{i}( x i)> then this is a model of 
independent (if not necessarily IID) data. In general, however, the choice of sufficient statistics may impose, 
or capture, dependence between observations. 

Because we are considering exponential families defined on increasingly large sets of observations, it 
is convenient to introduce some notation related to multiple statistics. Fix A,B £ A such that A c B. 
Then t B ■ X B >->• K d , and we will sometimes write this function t(x,y), where the first argument is in 
X A and the second in X B \ A . We will have frequent recourse to the increment to the sufficient statistic, 
t B \A(x, y) = t B (x, y) — t A (x). The volume factor v B (t B (x B )) is defined as before, but we shall also consider, 
for each observable value t of the sufficient statistics for A and increment S of the sufficient statistics from A 
to B, the joint volume factor, 

VA,B\A{t, S) = \{{x, y) £ X B : t A (x) = t and t B \ A (x, y) = S}\ , (6) 

and the conditional volume factor, 

v B \ A \ A (S,x) = \{y £ X B \ A : t B \ A (x,y) = S}\ . (7) 

As we will see, these volume factors play a key role in characterizing projectibility. 

3 Projective Structure in Exponential Families 

In this section we characterize projectibility in terms of the increments of the vector of sufficient statistics. In 
particular we show that exponential families are projective if, and only if, their sufficient statistics decompose 
into separate additive contributions from disjoint observations in a particularly nice way which we formalize 
in the following definition. 

Definition 3.1. The sufficient statistics of the family {V A ,e}AeA have separable increments when, for each 
A c B, x £ X A , the range of possible increments S is the same for all x, and the conditional volume factor 
is constant in x, i.e. v B \ A \ A (5, x) = v B \ A (6). 

It is worth noting that the property of having separable increments is an intrinsic property of the family 
{'Pa^aga that depends only on the functional forms of the sufficient statistics {t A } AeA and not on the 
model parameters 8 £ Q. This follows from the fact that, for any A, the probability distributions {F A ^}g e Q 
have identical support X A . Thus, this property holds for all of 8 or none of them. 

The main result of this paper is then as follows. 

Theorem 3.2. The exponential family {V A ,e}AeA is projective if and only if the sufficient statistics {T A } AeA 
have separable increments. 

3.1 Independence Properties 

Because projectibility implies separable increments, it also carries statistical-independence inmplications. 
Specifically, it implies that the increments to the sufficient statistics are statistically independent, and that 
X B \ A and X A are conditionally independent given increments to the sufficient statistic. Interestingly, inde- 
pendent increments for the statistic are necessary but not quite sufficient for projectibility. These claims are 
all made more specific in the propositions which follow. 

We first show that projectibility implies that the sufficient statistics have independent increments. In fact, 
a stronger results holds, namely that the increments of the sufficient statistics are independent of the actual 
sequence. Below we will write T B \ A to signify T B -T A . 

Proposition 3.3. If the exponential family {V A _e} AeA is projective, then sufficient statistics {T A } AeA have 
independent increments, i.e. A c B implies that T B — T A -^-T A under all 8. 

Proposition 3.4. In a projective exponential family, T B \ A -H-X A . 
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We note that independent increments for the sufficient statistics Ta in no way implies independence of 
the actual observations Xa- As a simple illustration, take the one-dimensional Ising model 4 , where I = N, 
each Xi = ±1, A consists of all intervals from 1 to n, and the single sufficient statistic Ti :n = Y17=i X{X i+ i. 
Clearly, T 1:(n+1) - T Un = +1 when X n = X n+1 , otherwise T 1:(n+1) - T Un = -1. Since v 1:(n+1) \ 1:n (+l, x) = 
u i:(n+i)|X:7i( — 1j x ) = 1> Theorem 3.2, the model is projective. By Proposition 3.3, then, increments of T 
should be independent, and direct calculation shows the probability of increasing the sufficient statistic by 1 
is e 9 /(l + e e ), no matter what Xi,...X n are. While the sufficient statistic has independent increments, the 
random variables Xi are all dependent on one another. 5 

The previous results provide a way, and often a simple one, for checking whether projectibility fails: if 
the sufficient statistics do not have independent increments, then the family is not projective. As we will see, 
this test covers many statistical models for networks. 

It is natural to inquire into the converse to these propositions. It is fairly straightforward (if somewhat 
lengthy) to show that independent increments for the sufficient statistics implies that the joint volume factor 
separates. 

Proposition 3.5. If an exponential family has independent increments, T b \ A -^-Ta, then its joint volume factor 
separates, V4,b\a(^ S) = VA(t)v B \A(8), and the distribution ofTis projective. 

However, independent increments for the sufficient statistics do not imply that separable increments 
(hence projectibility), as shown by the next counter-example. Hence independent increments are a necessary 
but not sufficient condition for projectibility. 

Suppose that Xa = {a, b, c, d}, and X B \ A — {i, H, Hi, iv, v}. (Thus there are 20 possible values for X B .) 

Let 

+ 1 = t A (a)=t A (b) 
-1 = t A {c)=t A (d) 



so that v A {+l) = v A (-l) = 2. Further, let 

2 = t B (a,i) = t B (a,ii) 

= t B (a,iii) — t B {a,iv) = t B (a,v) 

= t B (b, i) = t B (b, ii) 

2 = t B (b,iii) = t B (b,iv) = t B (b,v) 

tB(c,y) = t B (a,y)-2 

t B (d,y) = t B (b,y)-2 



It is not hard to verify that T B \ A is always either +1 or —1. It is also straightforward to check that 
v A,B\A(t,8) = 5 for all combinations of t and 8, implying that v B \ A (+l) = v B \ A (—^) = 2.5, and that 
the joint volume factor separates. On the other hand, the conditional volume factors are not constant in x, 
as v B \a\a(+1, a) = 2 while v B \a\a{+1, b) = 3. Thus, the sufficient statistic has independent increments, but 
does not have separable increments. Since projective families have separable increments (Proposition 7.1), 
this cannot be a projective family. (This can also be checked by a direct and straightforward, if even more 
tedious, calculation.) 

We conclude with section with a final observation. Butler (1986) showed that when observations follow 
from an IID model with a minimal sufficient statistic, the predictive distribution for the next observation can 
be written entirely in terms of how different hypothetical values would change the sufficient statistic. (Cf. 
Lauritzen 1974; Besag 1989.) This predictive sufficiency property carries over to our setting. 

technically, with "free" boundary conditions; see Landau and Lifshitz (1980). 

5 Note that while this is a graphical model, it is not a model of a random graph. (The graph is rather the one-dimensional lattice.) 
Rather, it is used here merely to exemplify the general result about exponential families. We turn to exponential random graph models 
in 55. 
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Figure 3: Relations among the main properties of models considered in §3. Probabilistic properties of the 
models are on the right, algebraic/ combinatorial properties of the sufficient statistic are on the left. 



Theorem 3.6 (Predictive Sufficiency). In a projective exponential family, the distribution of X B \ A conditional 
on Xa depends on the data only through T B \ A . 

The main implications among our results are summarized in Figure 3. 
3.2 Remarks, Applications and Extensions 

Exponential families of time series As the example of the Ising model in §3.1 (p. 7) makes clear, our the- 
orem applies whenever we need an exponential family to be projective, not just when the data are networks. 
In particular, they apply to exponential families of time series, where I is the natural or real number line 
(or perhaps just its positive part), and the elements of A are intervals. An exponential family of stochastic 
processes on such a space has projective parameters if, and only if, its sufficient statistics have separable 
increments, and so only if they have independent increments. 

Transformation of parameters Allowing the dimension of 9 to be fixed, but for its components to change 
along with A, does not really get out of these results. Specifically, if 9 is to be re-scaled in a way that is a 
function of A alone, we can recover the case of a fixed 9 by "moving the scaling across the inner product", 
i.e., by re-defining T4 to incorporate the scaling. With a sample-invariant 9, it is this transformed T which 
must have separable increments. Other transformations can either be dealt with similarly, or amount to 
using a non- uniform base measure (see below). 

Statistical-mechanical interpretation It is interesting to consider the interpretation of our theorem, and 
of its proof, in terms of statistical mechanics. As is well-known, the "canonical" distributions in statistical 
mechanics are exponential families (Boltzmann-Gibbs distributions), where the sufficient statistics are "ex- 
tensive" physical observables, such as energy, volume, the number of molecules of various species, etc., and 
the natural parameters are the corresponding conjugate "intensive" variables, such as, respectively, (inverse) 
temperature, pressure, chemical potential, etc. (Landau and Lifshitz, 1980; Mandelbrot, 1962). Equilibrium 
between two systems which interact by exchanging the variables tracked by the extensive variables obtains if 
and only if they have the same values of the intensive parameters (Landau and Lifshitz, 1980). In our terms, 
of course, this is simply projectibility, the requirement that the same parameters hold for all sub-systems. 
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What we have shown is that for this to be true, the increments to the extensive variables must be completely 
unpredictable from their values on the sub-system. 

Furthermore, notice the important role played in both halves of the proof by the separation of the joint 
volume factor, v A m A (t,S) = v A (t)v B \ A (8). In terms of statistical mechanics, a macroscopic state is a col- 
lection of microscopic configurations with the same value of one or more macroscopic observables. The 
Boltzmann entropy of a macroscopic state is (proportional to) the logarithm of the volume of those micro- 
scopic states (Landau and Lifshitz, 1980). If we define our macroscopic states through the sufficient statistics, 
then their Boltzmann entropy is just log v. Thus, the separation of the volume factor is the same as the ad- 
ditivity of the entropy across different parts of the system, i.e., the entropy is "extensive". Our results may 
thus be relevant to debates in statistical mechanics about the appropriateness of alternative, non-extensive 
entropies (cf. Nauenberg 2003). 

Beyond exponential families It is not clear just how important it is that we have an exponential family, 
as opposed to a family admitting a finite-dimensional sufficient statistic. As is well-known, the two concepts 
coincide under some regularity conditions (Barndorff-Nielsen, 1978), but not quite strictly, and it would 
be interesting to know whether or not the exponential form of equation (2) is strictly required. We have 
attempted to write the proofs in a way which minimizes the use of this form (in favor of the Neyman 
factorization, which only uses sufficiency), but have not succeeded in eliminating it completely. We return 
to this matter in the conclusion. 

Prediction We have focused on the implications of projectibility for parametric inference. Exponential 
families are however often used in statistics and machine learning as generative models in applications where 
the only goal is prediction (Wainwright and Jordan, 2008), and so (to quote Butler 1986) "all parameters 
are nuisance parameters". But even in then, it must be possible to consistently extend the generative model's 
distribution for the training data to a joint distribution for training and testing data, with a single set of 
parameters shared by both old and new data. While this requirement may seem too trivial to mention, it is, 
precisely, projectibility. 

Growing Number of Parameters In the proof of Theorem 3.2, we used the fact that T A , and hence 8, has 
the same dimension for all A 6 A. There are, however, important classes of models where the number of 
parameters is allowed to grow with the size of the sample. Particularly important, for networks, are models 
where each node is allowed a parameter (or two) of its own, such as its expected degree — see for instance 
the classic pi model of Holland and Leinhardt (1981), or the "degree-corrected block models" of Karrer and 
Newman (2011). We can formally extend Theorem 3.2 to cover some of these cases — including those two 
particular specifications — as follows. 

Assume that T A has a dimension which is strictly non-decreasing as A grows, i.e., d A < d B whenever 
A c B. Furthermore, assume that the set of parameters 8 A only grows, and that the meaning of the old 
parameters is not disturbed. That is, under projectibility we should have 

V BfiB -K B U A =V A ^ dB ^ dA e B {-) (8) 

For any fixed pair A c B, we can accommodate this within the proof of Theorem 3.2 by re-defining T A to 
be a mapping from X A to R dB , where the extra d B — d A components of the vector are always zero. The 
extra parameters in 8 B then have no influence on the distribution of X A and are unidentified on A, but we 
have, formally, restored the fixed-parameter case. The "increments" of the extra components of T B are then 
simply their values on X B , and, by the theorem, the range of values for these statistics, and the number of 
configurations on X B \ A leading to each value, must be equal for all x e X A . 

Adapting our conditions for the asymptotic convergence of maximum likelihood estimators (§4) to the 
growing-parameter setting is beyond our scope here. 
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Non-uniform base measures If the exponential densities in (2) are denned with respect to non-uniform 
base measures different from the counting measures, the sufficient statistics need not have separable in- 
crements. In Appendix A we address this issue and describe the modifications and additional assumptions 
required for our analysis to remain valid. We thank an anonymous referee and Pavel Krivitsky for indepen- 
dently brining up this subtle point to our attention. 



4 Consistency of Maximum Likelihood Estimators 

Statistical inference in an exponential family naturally centers on the parameter 9. As is well known, the 
maximum likelihood estimator 9 takes a particularly simple form, obtainable using the fact (which follows 
from equation (5)) that V g z A (9) = z A (9)~E g [T A ]\ 







e (e,t A {x)) 



za{6) 



-z A (9)t A (x)e& tA( - XA » + e& tA ( x » z A (6)E$[T A ] 



t A (x) = E S [T A ] (9) 



In words, the most likely value of the parameter is the one where the expected value of the sufficient statistic 
equals the observed value. 

Assume the conditions of Theorem 3.2 hold, so that the parameters are projective and the sufficient 
statistics have (by Lemma 7.3) independent increments. Define the logarithm of the partition function 

a A {9) = logZyi(#). 6 . Suppose that 

a A {9) = r\ A \a(0) (10) 

where \A\ is some positive-valued measure of the size of A, ri A \ a positive monotone-increasing function of 
it, and a: i-> K is differentiable (at least at 9). Then, by equation (5) for the moment generating function, 
the cumulant generating function of T A is 

K Afi {4>)=r\ A] {a{e + ct>)-a{0)) (11) 

From the basic properties of cumulant generating functions, we have 

E e [T A ] = V K A ,e(O) = r ]A \Va{9) (12) 

Substituting into equation (9), 

= Va(0) (13) 

r\ A \ 

Thus to control the convergence of 9, we must control the convergence of T A /r\ A \ . 

Consider a growing sequence of sets A such that r\ A \ — s- oo. Since T A has independent increments, and 
the cumulant generating functions for different A are all proportional to each other, we may regard T A as a 
time-transformation of a Levy process Y r (Kallenberg, 2002). That is, there is a continuous-time stochastic 
process Y with IID increments, such that Y\ has cumulant generating function a{9+(j))~a{9), and T A = Y r , Al . 
Note that T A itself does not have to have IID increments, but rather the distribution of the increment T B — T A 
must only depend on ri B i — ri A i. Specifically, from lemma 7.6 and equation (10), the cumulant generating 
function of the increment must be (ri B i — r\ A \)[a{6 + <j>) — a{9)]. The scaling factor homogenizes (so to speak) 
the increments of T. 

Writing the sufficient statistic as a transformed Levy process yields a simple proof that 9 is strongly (i.e., 
almost-surely) consistent. Since a Levy process has IID increments, by the strong law of large numbers 
^ / r|^i/ r |A| converges almost surely (Pg) to E e [Yi] (Kallenberg, 2002). Since T A = Y r . A ., it follows that 
Ta/i'\a\ Ee [Yi] a.s. (Pg) as well; but this limit is Va(0). Thus the MLE converges on 9 almost surely. We 
have thus proved 



5 In statistical mechanics, — a A would be the Helmholtz free energy. 



10 



Theorem 4.1. Suppose that the model Pg is projective, and that the log partition function obeys equation (10) 
for each A € A. Then the maximum likelihood estimator exists and is strongly consistent. 

We may extend this in a number of ways. First, if the scaling relation equation (10) holds for a particular 
9 (or set of 9), then Ta/t\a\ will converge almost surely for that 9. Thus, strong consistency of the MLE 
may in fact hold over certain parameter regions but not others. Second, when d > 1, all components of Ta 
must be scaled by the same factor n^i. Making the expectation value of one component of T be 0(|A|) while 
another was 0(|^4| 3 ) (for instance) would violate equation (12) and so equation (10) as well. 

Finally, while the exact scaling of equation (10), together with the independence of the increments, leads 
to strong consistency of the MLE, ordinary consistency (convergence in probability) holds under weaker 
conditions. Specifically, suppose that log partition function or free energy scales in the limit as the size of 
the assemblage grows, 

lim a A (d)/r\A\ = a{0) (14) 
H-a|->°o 

(We give examples towards the end of §5 below.) We may then use the following theorem: 

Theorem 4.2. Suppose that an exponential family shows approximate scaling, i.e., equation (14) holds, for 
some 9. Then, for any measurable set K C 



liminf — logP A0 ( — G K ) > - inf J(t) (15) 
r\A\-x>° r\ A \ \r\A\ ■ ' 

im sup — log V A .e ( — 
r| A |-Kx> r\ A \ \r\A\ 



lim sup — log ¥ A ,e (— €K] < - inf J(i) (16) 



tedK 



where 

J(t) = sup (0, t) - [a(9 + <t>)- a{9)] , (17) 

<t>GR d 

and mtK and c\K are respectively the interior and the closure of K. 

When the limits in equations (15) and (16) coincide, which they will for most nice sets K, we may say 
that 

— \ogV A ,e (— ek)^- inf J(t) (18) 
r \A\ \r\A\ J ^ K 

Since Jit) is minimized at when t = Va(9) 7 , equation (18) holds in particular for any neighborhood of 

Va(0), and for the complement of such neighborhoods, where the infimum of J is strictly positive. Thus 

^ p 

Ta/t\a\ converges in probability to Va(9), and 9^9, for all 9 where equaiton (14) holds. 

Heuristically when equation (14) holds but equation (10) fails, we may imagine approximating the actual 
collection of dependent and heterogeneous random variables with an average of IID, homogenized effective 
variables, altering the behavior of the global sufficient statistic T by no more than o P (r\ A \)- In statistical- 
mechanical terms, this means using renormalization (Yeomans, 1992). Probabilistically, the existence of 
a limiting (scaled) cumulant generating function is a weak dependence condition (den Hollander, 2000, 
§V.3.2). While under equation (10) we identified the Ta process with a time-transformed Levy process, now 
we can only use a central limit theorem to say they are close (den Hollander, 2000, §V.3.1), reducing almost- 
sure to stochastic convergence. (See Jona-Lasinio (2001) on the relation between central limit theorems and 
renormalization.) In any event, asymptotic scaling of the log partition function implies 9 is consistent. 



7 For small t £ M. d , by a second order Taylor expansion, J(e + Va(9)) as | (e, I(9)e), where 1(9) acts as the Fisher information rate; 
cf. Bahadur (1971). 
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5 Application: Non-projectibility of Exponential Random Graph Mod- 
els 



As mentioned in the introduction, our general results about projective structure in exponential families arose 
from questions about exponential random graph models of networks. To make the application clear, we must 
fill in some details regarding ERGMs. 

Given a group of n nodes, the network among them is represented by the binary n x n adjacency matrix 
X, where X%j — 1 if there is a tie from i to j and is otherwise. (Undirected graphs impose Xy = Xj{.) 
We may also have covariates for each node, say Yi. Our projective structure will in fact be that of looking 
at the sub-graphs among larger and larger groups of nodes. That is, A is the sub-network among the first n 
nodes, and B D A is the sub-network among the first n + m nodes. The graph or adjacency matrix itself is 
the stochastic process which is to have an exponential family distribution, conditional on the covariates: 

e (0,t{x,y)) 

My) = -mvT (19) 

(We are only interested in the exponential-family distribution of the graph holding the covariates fixed.) As 
mentioned above, the components of T typically count the number of occurrences of various sub-graphs or 
motifs — as edges, triangles, larger cliques, "fc-stars" (k nodes connected through a central node), etc. — 
perhaps interacted with values of the nodal covariates. The definition of T may include normalizing the 
counts of these "motifs" by data-independent combinatorial factors to yield densities. 

A dyad consists of an unordered pair of individuals. In a dyadic independence model, each dyad's 
configuration is independent of every other dyad's (conditional on Y). In an ERGM, dyadic independence is 
equivalent to the (vector-valued) statistic T adding up over dyads, 

n 

t(X, Y)=Y,Y1 Uj{Xi3,Xji,Yu Yj) (20) 

i— 1 j<i 

That is, the statistic can be written as a sum of terms over the information available for each dyad. In 
particular, in block models (Bickel and Chen, 2009), Yi is categorical, giving the type of node i, and the 
vector of sufficient statistics counts dyad configurations among pairs of nodes of given pairs of types. Dyadic 
independence implies projectibility: since all dyads have independent configurations, each dyad makes a 
separate additive contribution to T. Going from n — 1 to n nodes thus adds n terms, unconstrained by the 
configuration among the n — 1 nodes. T thus has separable increments, implying projectibility by Theorem 
3.2. (Adding a new node adds only edges between the old nodes and the new, without disturbing the old 
counts.) 8 As the distribution factorizes into a product of n(n — 1) terms, each of exactly the same form, the 
log partition function scales exactly with n(n — 1), and the conclusions of §4 imply the strong consistency 
of the maximum likelihood estimator 9 . This result thus applies to the well-studied /3-model (Barvinok and 
Hartigan, 2010; Chatterjee et al, 2011; Rinaldo et al, 2011) 

Typically, however, ERGMs are not dyadic independence models. In many networks, if nodes i and j are 
both linked to k, then i and j are unusually likely to be directly linked. This will of course happen if nodes 
of the same type are especially likely to be friends ("homophily", McPherson et al. 2001), since then the 
posterior probability of i and j being of the same type is elevated. However, it can also be modeled directly. 
The direct way to do so is to introduce the number (or density) of triangles as a sufficient statistic, but this 
leads to pathological degeneracy (Rinaldo et al., 2009), and modern specifications involve a large set of 

8 We have assumed the type of each node is available as a covariate. In the stochastic block model, types are latent, and the marginal 
distribution of graphs sums over type-conditional distributions. Proposition B.l in Appendix A shows that such summing-over-latents 
preserves projectibility. For stochastic block models, projectibility also follows from Lovasz and Szegedy (2006, Theorem 2.7(ii)). 

9 An important variant of such models are the "degree-corrected block models" of Karrer and Newman (2011), where each node has 
a unique parameter, which is its expected degree. It is easily seen that the range of possible degrees for each new node is the same, no 
matter what the configuration of smaller sub-graphs (in which the node does not appear), as is the number of configurations giving rise 
to each degree. The conditions of §3.2 thus hold, and these models are projective. 
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triangle-like motifs (Snijders et al, 2006; Wasserman and Robins, 2005; Handcock et al, 2008). Empirically, 
when using such specifications, one often finds a non-trivial coefficient for such "transitivity" or "clustering", 
over and above homophily (Goodreau et al, 2009). It is because of such findings that we ask whether the 
parameters in these models are projective. 

Sadly, no statistic which counts triangles, or larger motifs, can have the nice additive form of dyad 
counts, no matter how we decompose the network. Take, for instance, triangles. Any given edge among 
the first n nodes could be part of a triangle, depending on ties to the next node. Thus to determine the 
number of triangles among the first n + 1 nodes, we need much more information about the sub-graph of 
the first n nodes than just the number of triangles among them. Indeed, we can go further. The range of 
possible increments to the number of triangles changes with the number of existing triangles. This is quite 
incompatible with separable increments, so, by (3.2), the parameters cannot be projective. We remark that 
the non-projectibility of Markov graphs (Frank and Strauss, 1986), a special instance of ERGMs where the 
sufficient statistics count edges, fc-stars and triangles, was noted in Lauritzen (2008) . 

Parallel arguments apply to the count of any motif of k nodes, k > 2. Any given edge (or absence of 
an edge) among the first n nodes could be part of such a motif, depending on the edges involving the next 
k — 2 nodes. Such counts are thus not nicely additive. For the same reasons as with triangles, the range of 
increments for such statistics is not constant, and non-separable increments imply non-projective family. 

While these ERGMs are not projective, some of them may, as a sort of consolation prize, still satisfy 
equation (14). For instance, in models where T has two elements, the number of edges and the (normalized) 
number of triangles or of 2-stars, the log partition function is known to scale like n{n — 1) as the number of 
nodes n — > oo, at least in the parameter regimes where the models behave basically like either very full or 
very empty Erdos-Renyi networks (Park and Newman, 2004b, a, 2006; Chatterjee and Dey, 2010; Chatterjee 
and Diaconis, 2011; Bhamidi et al, 2011). (We suspect, from Park and Newman 2004b; Xiang and Neville 
2011; Chatterjee and Diaconis 2011, that similar results apply to many other ERGMs.) Thus, by equation 
(18), if we fix a large number n of nodes and generate a graph X from Fo,n, the probability that the MLE 
9(X) will be more than e away from 9 will be exponentially small in n(n — 1) and e 2 . Since these models 
are not projective, however, it is impossible to improve parameter estimates by getting more data, since 
parameters for smaller sub-graphs just cannot be extrapolated to larger graphs (or vice versa). 

We thus have a near-dichotomy for ERGMs. Dyadic independence models have separable and indepen- 
dent increments to the statistics, and the resulting family is projective. However, specifications where the 
sufficient statistics count larger motifs cannot have separable increments and projectibility does not hold. 
Such an ERGM may provide a good description of a given social network on a certain set of nodes, but 
it cannot be projected to give predictions on any larger or more global graph from which that one was 
drawn. If an ERGM is postulated for the whole network, then inference for its parameters must explicitly 
treat the unobserved portions of the network as missing data (perhaps through an expectation-maximization 
algorithm), though of course there may be considerable uncertainty about just how much data is missing. 

6 Conclusion 

Specifications for exponential families of dependent variables in terms of joint distributions are surprisingly 
delicate; the statistics must be chosen extremely carefully, in order to achieve separable increments. (Condi- 
tional specifications do not have this problem.) This has, perhaps, been obscured in the past by the emphasis 
on using exponential families to model multivariate but independent cases, as IID models are always projec- 
tive. 

Network models, one of the outstanding applications of exponential families, suffer from this problem 
in an acute form. Dyadic independence models are projective models, but are sociologically extremely im- 
plausible, and certainly do not manage to reproduce the data well. More interesting specifications, involving 
clustering terms, never have separable increments. We thus have an impasse which it seems can only be 
resolved by going to a different family of specifications. One possibility — which however requires more and 
different data — is to model the evolution of networks over time (Snijders, 2005). In particular, Hanneke 
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et al. (2010) considers situations where the distribution of the network at time t + 1 conditional on the 
network at time t follows an exponential family. Even when the statistics in the conditional specification 
include (say) changes in the number of triangles, the issues raised above do not apply. 

Roughly speaking, the issue with the non-projective ERGM specifications, and with other non-projective 
exponential families, is that the dependency structure corresponding to the statistics allows interactions 
between arbitrary collections of random variables. It is not possible, with those statistics, to "screen off" one 
part of the assemblage from another by conditioning on boundary terms. Suppose our larger information 
set B consists of two non- overlapping and strictly smaller information sets, A c B and C c B, plus the 
new observation obtained by looking at both A and C. (For instance, the latter might be the edges between 
two disjoint sets of nodes.) Then the models which work properly are ones where the sufficient statistic 
for B partitions into marginal terms from A and C, plus the interactions strictly between them: is(Xe) = 
t A (X A ) + T C (X C ) + T b \(auc)(X b \(Auc))- hi physical language (Landau and Lifshitz, 1980), the energy for 
the whole assemblage needs to be a sum of two "volume" terms for its sub-assemblages, plus a "surface" term 
for their interface. The network models with non-projective parameters do not admit such a decomposition; 
every variable, potentially, interacts with every other variable. 

One might try to give up the exponential family form, while keeping finite-dimensional sufficient statis- 
tics. We suspect that this will not work, however, since Lauritzen (1988) showed that whenever the sufficient 
statistics form a semi-group, the models must be either ordinary exponential families, or certain generaliza- 
tions thereof with much the same properties. We believe that there exists a purely algebraic characterization 
of the sufficient statistics compatible with projectibility but must leave this for the future. 

One reason for the trouble with ERGMs is that every infinite exchangeable graph distribution is actu- 
ally a mixture over projective dyadic-independence distributions (Diaconis and Janson, 2008; Bickel and 
Chen, 2009), though not necessarily ones with a finite-dimensional sufficient statistic. Along any one se- 
quence of sub-graphs from such an infinite graph, in fact, the densities of all motifs approach limiting values 
which pick out a unique projective dyadic-independence distribution (Diaconis and Janson, 2008) (and cf. 
Lauritzen 1988, 2008). This suggests that an alternative to parametric inference would be non-parametric 
estimation of the limiting dyadic-independence model, by smoothing the adjacency matrix; this, too, we 
pursue elsewhere. 
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7 Proofs 

For notation in this section, without loss of generality, fix a generic pair of subsets A c B and a value of 9. 
We will write a representative point xb G Xb as = (x, y), with x G Xa and y G X B \a- Also, we abbreviate 

t B (x,y) - t A (x), for x G Xa and y G X B \a by t B \ A (x,y). 

7. 1 Proof of Theorem 3.2 

For clarity, we prove the two directions separately. First we show that projectability implies separable incre- 
ments. 

Proposition 7.1. If the exponential family {Ve}AeA is projective, then the sufficient statistics {Ta}aca have 
separable increments, i.e. Ad B implies that v b \a\a(<>, x) = v B \a(S). 
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Proof. By projectibility, for each < 



e (9,t B (x,y)) 

PA,e(x) = 2^ p B ,e(x,y)= 2^ - (21) 

y&X B \ A V&Xb\a 

= —im exp{(6,t B (x,y)-t A (x)) + (0,t A {x))} (22) 

y£XB\A 

p (B,t A (x)) (n\ 

= ^T^y 5: exp{( MBV4 (^))} (23) 

y£ x B\A 

= PAfi { x ) Z -^l J2 exp{(6,t BXA (x,y))}, (24) 
ZB{6) v^a 

which implies that, for all x e X A , 

J2 exp{(9,t BXA (x,y))} = ^. (25) 

Z A\P) 

y&x B \ A 

Re-writing the left-hand side of equation (25) as a sum over the set A(x) of values which the increment 
t B \ A (x, y) to the sufficient statistic might take yields 

X! v B \ A \ A (8,x)exv{0,8) = ^-JL- (26) 

where the joint volume factor is defined in (6). Since the right-hand side of equation (26) is the same for all 
x, so must the left-hand side. 

Observe that this left-hand side is the Laplace transform of the function v B \ A \ A {-, x). The latter is a 
non-negative function which defines a measure on R d , whose support is A (at). Hence, 

vb\ A \a(8,x) 

is the Laplace transform of a discrete probability measure in R d . But the denominator in the inner sum is 
just \X B \ A \, no matter what x might be 10 . So we have that for any x, x' G X A , and all 0e8, 

v b \a\a(S,x) /a v-^ v B \ AlA (S,x') 
V — k; , — exp (0, 8) = \ — ^ . — exp (9, 8) (28) 

Since both sides of equation (28) are Laplace transforms of probability measures on a common space, and 
the equality holds on all of 9, which contains an open set, we may conclude that the two measures are equal 
(Barndorff-Nielsen, 1978, Theorem 7.3). This means that they have the same support, A(x) = A(x') = A, 
and that they have the same density with respect to counting measure on A. As they also have the same 
normalizing factor (viz., |Afg\A|)j we g et that v B \ A \ A (S, x) — v B \ A \ A (6, x') — v B \ A (8). Since the points x and 
x' are arbitrary this last property is precisely having separable increments. ■ 

Next, we prove the reverse direction, namely that separable increments imply projectibility. This is clearer 
with some preliminary lemmas. 



10 This can be seen either from recalling that exponential families have full support, or from defining Tb as a total and not a partial 
function on Xg . 



15 



Lemma 7.2. If the sufficient statistics have separable increments, then the joint volume factors factorize, i.e., 

VA,B\A(t, 5) = v A (t)v B \ A (8), (29) 

for all Ac B, t and 5. 
Proof. By definition, 

VA,B\A{t,S)= ^2 v B \ A \ A (5,x). (30) 

{xex A ■ t A (x)=t} 

When the statistic has separable increments, v b \a\a(S,x) = v B \ A (5), so 

VA,B\A(t,S)= ^2 v b\a(S) = v A (t)v B \ A {S), (31) 

{x: t A (x)=t} 

proving the claim. ■ 

Lemma 7.3. If the joint volume factor factorizes, then the sufficient statistics has independent increments, and 
the distribution of the sufficient static is projective. 

Proof. Without loss of generality, fix a value t for T A and S for T B \ A . By the law of total probability and the 
definition of the volume factor, 

e (e,t) e (0,t) 

¥e, B (T A = t,T B \ A = S)= v A , B \A(t,6) . (32) 

If the volume factor factorizes, so that v A . B \ A (t, S) = v A (t)v B \ A (5), then we obtain 



W e , B (T A =t,T B \ A = 5) 

It then follows that 



1 v A {t)e™ 



z A {0) 



Z a{9) f c\ (6.8) 



(33) 



Pe.B(T A = t,T BXA = S)=¥eMTA=t)¥ e , B (T BXA = S), V0, (34) 

and thus that T has independent increments. To establish the projectibility of the distribution of T, sum over 

5: 

Pb,b(T a = t) = J2 ¥s > B ( TA=t > T B\A = S) 



v A (t)e 



(e,t) 



6) 



VA(t)e^ 

Since F At $(T A = t) = v A (t)e^'^ /z A (9), and both distributions must sum to 1 over t, we can conclude that 
za(0) = z B {6)/z B \ A (8), and hence that the distribution of the sufficient statistic is projective. ■ 

Lemma 7.4. If the sufficient statistics of an exponential family have separable increments, then 

V Bfi {X A = x,T B \ A = S) = —l—W Bi g(T A = t A {x),T B \ A = 5) (35) 

V A (t A (X)) 
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Proof. Abbreviate tA(x) by t. By the law of total probability, 

P B ATA=t,T B \ A = S)= PbA^V) (36) 

(x,y):t A (x)=t,t B \ sA {x.,y)=8 

Since Tb is sufficient, and ts{x,y) — t + S for all (x, y) in the sum, 

PbATa = t, T B \ A = S) = v A , BXA (t, 6)e^ t+ V/z B (9) (37) 

By parallel reasoning, 

P>B,e(X A = x,T BXA = 5) = v BXAlA (6,x)e i > e - t+ V/z B (9) (38) 

Therefore 

TB (Y T 1 x\ ix \ V B : e(T A ^t,T B \ A =S) 

PbA x a = x,T b \a = d ) = v b \a\a{o,x) -—r (39) 

VA,B\A{t,0) 

If the statistic has separable increments, then VA.B\A(t^) — v A(t) v B\A(3) — v A(t)vB\A\A(S,x), and the 
conclusion follows. ■ 

Remark: The lemma does not follow merely from the joint volume factor separating, wa,b\a(*> 8) — 
VA(t)vB\A($)- The conditional volume factor must also be constant in x. 

Proposition 7.5. If the sufficient statistic of an exponential family has separable increments, then the family is 
projective. 

Proof. We calculate the marginal probability of Xa in P#,b, by integrating out the increment to the sufficient 
statistic. (The set of possible increments, A, is the same for all x, by separability.) Once again, we abbreviate 

t A (x) by t. 



VbAXa = x) = Y, P bAXa=x,T B \a = 5) 



J2 VbATa = t, T B \ A = S) 

VbA t a = t)V B A T B\A = s\t a = t) 

VA ^ > seA 
WbA t a = t) 

v A {t) 
Pa,o(Ta — t) 

v A (t) 
PaAXa = x) 



These steps use, in succession: Lemma 7.4; the fact that conditional probabilities sum to 1; the projectibility 
of the sufficient statistics (via Lemmas 7.2 and 7.3); and the definition of VA(t). ■ 

7.2 Other Proofs 

Proof of Proposition 3.3. By Proposition 7.1, a projective family has separable increments, and by Lemma 
7.3, separable increments implies independent increments. ■ 

Proof of Proposition 3.4. By Proposition 7.1, every projective exponential family has separable increments. 
By Lemma 7.4, in an exponential family with separable increments, 

VbAXa = x,T B \ A =S)= 1 ,, PbATa = t A (x),T B \A = 8) (40) 

VA(tA(Xjj 



17 



Therefore, using projectibility, 

F (T X]Y v PbATa = t A {x),T B \ A = 5)/v A (t A (x)) 

™B,9\T B \ A = 5 \X A = X) = -r- (41) 

PaAA 

By the definition of v A (-), p A ,e(x) = F A j(T A = t A (x))/v A (t A (x)), so 

IB ily \ v bA t a = t A {x),T B \ A = (5) 

¥ At e[T A = t A [x)) 

But, by Lemma 7.3, the sufficient statistics have a projective distribution with independent increments, 
implying 

V Bfi {T A =t A {x),T B \ A = 5)=F A , e {T A = t A {x))F Bfi {T B \ A = 8) (43) 

Therefore, 

V B ,g( T B\A = S\X A = x)= V B . e (T BXA = 6) (44) 

and so T B \ A JLx A . U 

Proof of Proposition 3.5. Below we prove that if the sufficient statistics of an exponential family have inde- 
pendent increments, then the volume factor separates, and the distribution of the statistic is projective. 
Since T B is a sufficient statistic, by the Neyman factorization theorem (Schervish, 1995, Thm. 2.21,p. 89) 

F Bfi {X A = x,X B \ A =y) = g B (9,t A (x) + t B \ A (x,y))h(x,y) (45) 

In light of equation (2), we may take h(x, y) = 1. Abbreviating t A (x) by t and t B \ A (x, y) by S, it follows that 

PbATa = t,T B \ A = S)= v AtB \ A (t,S)g B (9,t + 6) (46) 

By independent increments, however, 

Vb,o(Ta = t,T B \A = S) = VbA t a = t)P B A T B\A = 5) (47) 
whence it follows that, for some functions g B \ A , k A , k B \ A , 

g B (9,t + S)=g A (9 1 t)g BXA (9,6) (48) 

and 

v A , B \ A (t,d) = k A {t)k B \ A {5) (49) 

and 

V B AT A = t,T B \ A = S) = k A (t)k B \ A (d)g A {9,t)g B \ A (9,S) (50) 

To proceed, we must identify the new g and k functions. To this end, recalling that v A (t) is the number of 
x A configurations such that t A {x A ) = t, we have 

J2 V A,B\A(t,S) = V A (t)\X B \ A \ (51) 

s 

and, at the same time, 

v a,b\a (t, S) = k A (t) k B \A (5) ■ (52) 

8 8 

Clearly then, k A (t) = civ A (t) while J2s k B\A{S) = c 2 \X B \ A \. Since 

J2J2 V A,B\A(t,S) = \X A \\X B \ A \ (53) 
t 8 
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and J2t v A(t) = \Xa\> we need cic 2 = 1, and may take c\ = ca = 1 for simplicity. This allows us to write 

VA,B\A(t, s ) = v A(t)v B \A(S) (54) 

which is exactly the assertion that the volume factor separates. 

Turning to the g functions, we sum over 5 again to obtain the marginal distribution of Ta- 

¥ B , e (T A =t) = Y, V BATA=t,T B \ A = 6) 

8 

= ^2 v A(t)gA(9,t)v B \ A (6)g B \ A (9,S) 
s 

= v A (t)g A (9,t)^2v B \ A (S)g B \ A (6,5). 
s 

Now, we finally we use the exponential-family form. Specifically, we know that 

e (O,t) e (0,5) 

9B{0,t + S) = j^, (55) 

so that g A (0,t) oc e< e '*>, g B \ A {6,6) oc e< 9 >*>. Therefore, 

VbATa =t)oc v A (t)e^ ex F A ,e( T A = *), (56) 

and normalization now forces 

V Bfi {T A =t)= V A ,e(TA = t), (57) 
as desired. ■ 

Proof of Theorem 3.6. The conditional density of X b \a given Xa is just the ratio of joint to marginal densities 
(both with the same 9, by projectibility): 

PBfiM) e^^)/z B (9) 
PA.e{x) e (v,t A {x)) jz A {9) 

p{0,tB\A{x,y)) 

= (59") 

z B (9)/z A (9) 

which is an exponential family with parameter 9, sufficient statistic T b \a, and partition function z b \a\a{9) = 
z B (9)/z A (9). ' " ^ ■ 

Proof of Theorem 4.2. Under equation (14), the cumulant generating function also scales asymptotically, 
K A.e{4>)l r \A\ a (® + 4>)~ a{6). Since a is differentiable, the Gartner-Ellis theorem of large deviations theory 
(den Hollander, 2000, ch. V) implies that Ta/t\m obeys a large deviations principle with rate r\M, and rate 
function given by equation (17), which is to say, equations (15) and (16). ■ 

Lemma 7.6. The moment generating function ofT B \ A is 

z B (9 + 4>)z A (9) = MeA® f6 . 
z B {9)z A (6 + <f>) M 9 , A (<f>) 
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Proof. From the proof of Theorem 3.6, X b \a\Xa has an exponential family distribution with sufficient statis- 
tic Tb\a- Thus we may use equation (5) to find the moment generating function of Tb\a conditional on Xa- 



m !M\ Z B\A\A{V + (P) 

= ZBwie) (61) 
z B {6 + 4>)/z A {e + 4>) (62) 



zb{6)/z a {6) 
z b {8 + 4)z a (8) _ M e , B {4>) 



(63) 



z B (0)z A (6 + 0) M e A<t>) 

(64) 

Since, however, T b \a-^-Xa (Proposition 3.4), equation (60) must also give the unconditional moment gen- 
erating function. ■ 
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A Non-Uniform Base Measures 

For each A e A, we introduce a finite reference measure \jla- The exponential family is then given not by Eq. 
(2) in SR but by 

e (S-t A (x)) 

PaA x ) = Ma(z) t^t- (65) 

with the suitably-modified partition function 

z A {6) = VA{x)e^ tA ^ (66) 
xex A 

The family of base measures need not itself be projective. However, we do require that if ^a{x) > 0, 
then [i B (x x %b\a) > 0. (If this is not so, then some configurations which are allowed for small samples 
are forbidden for larger ones, and clearly no exponential family with respect to this fi could possibly be 
projective.) With this condition, the ratio 

VB\A\A{y\x) = VB(x,y)/n A {x) (67) 

is well-defined /i£-almost-everywhere. It defines a finite measure on X B \ A - 

The use of a non-uniform base measure requires some modifications to the definitions of volume factors. 
First, marginal volume factors are given in terms of the base measure: 

v A (t) ^ ^a({x : t A (x) = t}) = 22 ha(x) (68) 

x:t A (x) 

The joint volume factor may also be defined directly from the base measure: 

VA,B\A(t,6) = [i B ({x,y : t A {x),t B \ A {x,y) = S}) = h B {x,y) (69) 

x,y:t A (x),t B \ A (x,y)—8 

and the conditional volume factor from the ratios of the base measures: 

Vb\a\a(S,x)= Y VB\A\A(y\x) (70) 

y-t B \ A (x,y)=S 

Note that all of these definitions reduce to the ones given before when fi is counting measure. 

We still say that the sufficient statistics have separable increments when v b \ A \a(^x) — v B \ A ((>) for 
all x. Unfortunately, whether the statistics have separable increments can change with the choice of base 
measure 11 . 

n We are grateful to an anonymous referee and to Pavel Krivitsky for independently raising this question. To give a trivial example, 
let X A = {a,b}, X B \ A = {a,P}, with T B (a,a) - T A {a) = T B {b,0) - T A (b), while T B (a,p) - T A (a) = T B (b,a) - T A (b), 
and these two increments are different from each other. Clearly T has separable increments under the uniform base measure. If 
H B \ A (ot) ^ )J. B \ A (fi), then T does not have separable increments. 
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There needs to be one further restriction on the base measures, which is that, for all x and x' in Xa, 

Y ^b\a\a(v\x) = ^b\a\a{vW) ■ (71) 

This holds when is a product measure on A x B \ A. (This is the case, for instance, for the model of 
Krivitsky et al. (2011).) It also holds if the family of ^ is projective, since then (i B \a\a is a conditional 
probability measure and must integrate to 1. We conjecture that Eq. 71 can in fact always be imposed, 
through a suitable re-scaling of \x, but have not shown this. 

Under Eq. 71 and the modified definitions of the volume factors, most proofs go through as given above. 
The two exceptions are as follows. 

In the proof of Proposition 4 in SR, the normalizing factor in Eq. (27) in SR, Y^5'eA(x) v b\a\a($', x) is the 
same as J2yex B A ^b\a\a(v\x)- This is not necessarily the cardinality |Af B \x| 3 but so long as it is constant in 
x, the rest of the proof holds. 

In the proof of Lemma 1 in SR, we modify Eq. (31) in SR to read 

VA,B\A(t,$) = Y ^a{x)v B \a\a{S,x) (72) 

X'.tA ( x )— t 

which is easily verified from the definitions. The proof now goes through as before. 

B From Conditional to Unconditional Projectibility 

Proposition B.l. Suppose that, for each A e A, = Xa X Ca, and that {Va}a^a is a family of distributions 
on y a, i-e., of joint distributions of Xa and Ca- If (0 the marginal distributions of {Ca} A ea are projective, (ii) 
the conditional distributions Xa\Ca are projective (almost always), and (Hi) under P B , X A is independent of 
C b \a given Ca, then the marginal distribution of X is projective. 

Proof Use the law of total probability to expand P b (Xa = x): 

P B (X A = x) = P B (X A = x,X B \ A = y,C A = c,C B \ A = d) 

c,d,y 

= },Pb{Ca = c,C B \ A = d)Y p B{XA = x,X B \ A = y\C A = c,C B \ A = d) 
c.d y 

= J2 Pb( - Ca = c < °b\a = d)P A (X A = x\C A = c) 
by conditions (ii) and (iii) . 

P B (X A = x) = P b(°a = c)P A (X A = x\C A = c)Y Pb{C B \a - d\C A = c) 

c d 

= J2 Pb{ - Ca = c ) p a( x a = x\C A = c) 

C 

= P A (X A = x) 

since by condition (i), P a (Ca = c) = P b (Ca = c). ■ 

In addition to the application to stochastic block models, we note that this (reassuringly) shows that 
hidden Markov models are projective. 
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